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Abstract 

We explore the use of generalized t priors on regression coefficients to help understand 
the nature of association signal within "hit regions" of genome- wide association studies. The 
particular generalized t distribution we adopt is a Student distribution on the absolute value 
of its argument. For low degrees of freedom we show that the generalized t exhibits 'sparsity- 
prior' properties with some attractive features over other common forms of sparse priors and 
includes the well known double-exponential distribution as the degrees of freedom tends to 
oo. We pay particular attention to graphical representations of posterior statistics obtained 
from sparsity-path-analysis (SPA) where we sweep over the setting of the scale (shrinkage 
/ precision) parameter in the prior to explore the space of posterior models obtained over 
a range of complexities, from very sparse models with all coefficient distributions heavily 
concentrated around zero, to models with diffuse priors and coefficients distributed around 
their maximum likelihood estimates. The SPA plots are akin to LASSO plots of maximum 
a posteriori (MAP) estimates but they characterise the complete marginal posterior distri- 
butions of the coefficients plotted as a function of the precision of the prior. Generating 
posterior distributions over a range of prior precisions is computationally challenging but 
naturally amenable to sequential Monte Carlo (SMC) algorithms indexed on the scale pa- 
rameter. We show how SMC simulation on graphic-processing-units (CPUs) provides very 
efficient inference for SPA. We also present a scale-mixture representation of the general- 
ized t prior that leads to an EM algorithm to obtain MAP estimates should only these be 
required. 



1 Introduction 



Genome-wide association studies (GWAS) have presented a number of interesting challenges 
to statisticians [1]. Conventionally GWAS use single marker univariate tests of association, 
testing marker by marker in order to highlight "hit regions" of the genome showing evidence of 
association signal. The motivation for our work concerns methods in Bayesian sparse multiple 
regression analysis to characterise and decompose the association signal within such regions 
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spanning possibly hundreds of markers. Hit-regions typically cover loci in high linkage disequi- 
librium (LD) which leads to strong correlation between the markers making multiple regression 
analysis non-trivial. A priori we would expect only a small number of markers to be respon- 
sible for the association and hence priors that induce sparsity are an important tool to aid 
understanding of the genotype-phenotype dependence structure; an interesting question being 
whether the association signal is consistent with a single causal marker or due to multiple effects. 
We restrict attention to case-control GWAS, i.e. binary phenotypes, these being by far the most 
prevalent although generalisations to other exponential family likelihoods or non-linear models 
is relatively straightforward. 

In the non-Bayesian literature sparse regression analysis via penalised likelihood has gained 
increasing popularity since the seminal paper on the LASSO [2, 3]. The LASSO estimates have 
a Bayesian interpretation as maximum a posteriori (MAP) statistics under a double exponen- 
tial prior on regression coefficients 7r(/3) oc exp(— A^^- and it is known that, unlike ridge 
penalties using a normal prior, the Lasso prior tends to produce 'sparse' solutions in that as 
an increasing function of the regulariation penalty A more and more of the MAP estimates will 
be zero. From a Bayesian perspective the use of MAP estimates holds little justification, see 
the discussion of [3], and in [4] they explore full Bayesian posterior analysis with Lasso double- 
exponential priors using Makov chain Monte Carlo (MCMC) to make inference. While the Lasso 
penalty / prior is popular there is increasing awareness that the use of identical 'penalization' 
on each coefficient, e.g. A Yl^=i l/^j I? can lead to unacceptable bias in the resulting estimates [5]. 
In particular coefficients get shrunk towards zero even when there is overwhelming evidence in 
the likelihood that they are non-zero. This has motivated use of sparsity-inducing non-convex 
penalties which reduce bias in the estimates of large coefficients at the cost of increased diffi- 
culty in computing. Of note we would mention the "adaptive" methods [6, 7] in the statistics 
literature and iteratively reweighted methods [8, 9] in the signal processing literature; although 
these papers are only concerned with MAP estimation there being no published articles to date 
detailing the use of full Bayesian analysis. 

In the context of GWAS there have been recent illustrations of the use of adaptive sparsity 
priors and MAP estimation [10], reviewed and compared in [11]. In [10] they use the Normal- 
Exponential-Gamma sparsity-prior of [12] to obtain sparse MAP estimates from logistic regres- 
sion. The use of continuous sparse priors can be contrasted with an alternative approach using 
two component mixture priors such as in [13] which adopt a distribution containing a spike at 
zero and another component with broad scale. Coefficients are then a posteriori classified into 
either state, relating to whether their corresponding variables (genotypes) are deemed predic- 
tively irrelevant or relevant. In the GWAS setting the two-component mixture prior has been 
explored by [14, 15]. For model exploration the sparsity prior has some benefits in allowing the 
statisticians to visualize the regression analysis over a range of scales / model complexities. 

In this paper we advocate the use of the generalized t prior, first considered in [16], on the 
regression coefficients and a full Bayesian analysis. The generalized t prior we adopt is a 
Student distribution on the absolute value of the coefficient. We demonstrate it has a number 
of attractive features as a 'sparsity prior' in particular due to its simple analytic form and 
interpretable parameters. In applications the setting of the scale parameter in the prior is the 
key task that affects the sparsity of the posterior solution. We believe much is to be gained in 
exploring the continuum of posterior models formed by sweeping through the scale of the prior, 
even over regions of low posterior probability, in an exploratory approach we term sparsity- 
path-analysis (SPA). This leads towards providing graphical representations of the posterior 
densities that arise as we move from the most sparse models with all coefficient densities heavily 
concentrated around the origin through to models with diffuse priors and coefficients distributed 
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around their maximum likelihood estimates. We stress that even though parts of this model 
space have low probability, they have utility and are interesting in gaining a fuller understanding 
of the predictor-response (genotype-phenotype) dependence structure. SPA is an exploratory 
tool that is computationally challenging but highly suited to sequential Monte Carlo (SMC) 
algorithms indexed on the scale parameter and simulated in parallel on graphics cards using 
graphical processing units (CPUs). 

The generalized t as a prior for Bayesian regression coefficients was first suggested by [16] and 
more recently in the context of sparse MAP estimation for normal (Gaussian) likelihoods in 
[17, 18, 19], and using Markov chain Monte Carlo for a full Bayesian linear regression analysis 
in [19]. In [18, 19] the prior is referred to as a double-pareto distribution but we prefer to use 
the terminology of the generalized t as we feel it is more explicit and easier to interpret as 
such. Our contribution in this paper is to consider a full Bayesian analysis of logistic regression 
models with generalized t priors in the context of GWAS applications for which we develop SMC 
samplers and GPU implementation. The GPU implementation is important in practice, in the 
GWAS analysis in reduces run-time from over five days to hours. We pay particular attention 
to developing graphical displays of summary statistics from the joint posterior distribution over 
a wide range of prior precisions; arguing that parts of the model space with low probability 
have high utility and are informative to the exploratory understanding of regression signal. 

In Section 2 we introduce the generalized t prior and show that it has a scale-mixture rep- 
resentation and exhibits sparsity type properties for regression analysis. Section 3 provides a 
brief description of the genetic association data and simulated phenotypes used in our methods 
development. Section 4 concerns computing posterior distributions over a range of prior scale 
parameters using sequential Monte Carlo algorithms simulated on CPUs. Section 5 deals with 
graphical display of SPA statistics and plots of posterior summary statistics over the path of 
scales. We conclude with a discussion in Section 6. 



2 Generalized t prior for sparse regression analysis 

In Bayesian logistic regression analysis given a data set {X,y}, of (n x p) predictor matrix X 
and (n x 1) vector of binary phenotypes (or response variables) y, we adopt a prior p{f3) on the 
p regression coefficients from which the log-posterior updates as, 

logp(/3|X, y, a, c) oc log /(y |X, f3, 0) + logp(;9) 

where /(•) is the likelihood function, 

n 

/(y|x,^) = X{{Pir{i-Pif-y' 

i=l 

Pi = logit~^{xi(3) 

In this paper we propose the generalized t distribution as an interesting prior for /3, which has 
the form of a Student density on the L^-norm of its argument, [16], 

a, c, q) = rr\-, ^ 1 + - — ^ (1) 

where we note that the usual Student's density is obtained with = 2, c = a/2}; the exponential 
power family of distributions arise as degrees of freedom a ^ oo; and the double-exponential 
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or Laplace distribution occurs with = 1, a = oo}. The use of this distribution as a prior on 
regression coefficients was first described in [16]. 

We shall only consider the Li-norm case with central location = 0, and independent priors 
across the p coefficients for which we can simplify the distribution to, 

p{f3j\fi = 0, a, c, g = 1) = - h + ^ J (2) 

where 6 > is a scale parameter and a > is related to the degrees of freedom; we will 
denote this centred Li generalized t as Gt{a,c). Note that [18, 19] refer to this distribution 
as a double-pareto but we prefer to think of it as an instantiation of a generalized t following 
the earlier derivation in the context of priors on regression coefficients in [16]. This also allows 
for easier understanding of the parameters and direct connections with the double-exponential 
distribution and other standard densities. 



It is interesting to note that (2) can be derived from the marginal of a scale-mixture of double- 
exponential densities, such that, for multivariate /3 = {/3i, . . . ,/3p}, where, 

p{/3) = llpmr,) 

j 

Tj - IG{a,b) (3) 
where •) denotes the Inverse- Gamma density, from this we find, 

p{Pj) = / p{xj\rj)p{Tj\a^b)dTj = Gt{a^b/a). 

where we can set 6 = ac to generate a marginal Gt(a, c) density. Hence as a prior on the set of 
regression coefficients, one way to think of (2) is as defining a hierarchical prior whereby each 
coefficient has a double-exponential prior with a coefficient specific scale parameter distributed 
a priori as /G(a,ac). A plot of the log densities for Gt(l,l), and the double-exponential, 
DE(0^ 1) is shown in Figure 1 where we can see the greater Kurtosis and heavier tails provided 
by the generalized t. As mentioned in the Section 1, the relatively light tails of the DE{') prior 
is unattractive as it tends to shrink large values of the coefficients even when there is clear 
evidence in the likelihood that they are predict ively important. 

We are interested with inferring and graphing the regression coefficents' posterior distributions 
over a range of values for c for fixed degrees of freedom a, an exploratory procedure we call 
sparsity-path-analysis. But first it is instructive to examine the form of the MAP estimates 
obtained under the generalized t prior as this sheds light on their utility as sparsity priors. 



2.1 MAP estimates via the EM algorithm 



The optimization problem associated with obtaining the MAP estimates for j3 under the gen- 
eralized t-distribution prior is not concave. However, one can find local modes of the posterior 
by making use of the scale-mixture representation (3) and using the EM algorithm with the 
hierarchical scale parameters r = ri-^p as latent variables. Indeed, each iteration of EM takes 
the form 

/3(*+i)=argmaxlog/(y|X,^)+ / \og\p{l3\T)]p{T\l3^'\a,h)dT 
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where b = ac. The conjugacy of the inverse-gamma distribution with respect to the Laplace 
distribution gives 

T,\pf\aj,bj^IG{aj + l,b, + \p,\) 

allowing for different prior parameters for each coefficient, and with p(/3|r) = Y\^=iP{f3j\Tj) = 
n^^il/(2r,)exp(-|/3,|/r,) yields 

/3(*+i)=argmaxlog/(y|X,/3,^)-^|/3,-| / -p{rj\l3f\aj,bj)dTj 

where the expectation of I/tj given tj ~ IG{aj + 1, bj + is {aj + l)/{bj + 

As such, one can find a local mode of the posterior p(/3|y, X, /3, 0) by starting at some point 
/3^^^ and then iteratively solving 

p 

f3^'^'^ = argmaxlog/(y|X,/3,^) -^wf^\^j\ (4) 

where 



w 



aj + 1 



For logistic regression the update (4) is a convex optimization problem for which standard 
techniques exist. We can see the connect: 
where A = 1/c is the usual Lasso penalty. 



techniques exist. We can see the connection to the Lasso as with a ^ oc we find w^f^ — 



2.2 Oracle properties of the MAP 



In the penalized optimization literature, some estimators are justified at least partially by their 
possession of the oracle property: that for appropriate parameter choices, the method performs 
just as well as an oracle procedure in terms of selecting the correct predictors and estimating 
the nonzero coefficients correctly asymptotically in n when the likelihood is Gaussian. Other 
related properties include asymptotic unbiasedness, sparsity and continuity in the data [5]. 
Penalization schemes with the oracle property include the smoothly clipped absolute deviation 
method [5] , the adaptive lasso [6] and the one-step local linear approximation method [7] . 

For the generalized t-distribution, the MAP estimate obtained using this prior has been exam- 
ined in [17, 18, 19], all of which derive the expectation-maximisation algorithm for finding a 
local mode of the posterior. It is straightforward to establish that the estimate is asymptoti- 
cally unbiased, sparse when c < 2^ {a + l)/a and continuous in the data when c = ^(a -f- l)/a 
following the conditions laid out in [5]. [19] shows that this scheme satisfies the oracle property 
if simultaneously a ^ oo, a/ y^^) ~^ acy^n) ^ (7 < oo for some constant C. 

It is worth remarking that the oracle property requires the prior on (3 to depend on the number 
of observations n. This is in conflict with conventional Bayesian analysis in that the prior 
should represent your beliefs about the data generating mechanism irrespective of the number 
of observations that you may or may not receive. In addition in the context of applications your 
n and p may well not be in the realm of asymptotic relevance. 
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2.3 Comparison to other priors 



There have been a number of sparsity prior representations advocated recently in the hterature, 
although it is important to note that most of the papers deal solely with MAP estimation and not 
full Bayesian analysis. To name a few important contributions we highlight the normal-gamma, 
normal-exponential-gamma (NEG) and the Horseshoe priors, see [20, 21, 12, 22]. Like all these 
priors the shares the properties of sparsity and adaptive shrinkage, allowing for less 

shrinkage on large coefficients as shown in Fig. 2 for a normal observation, and a scale-mixture 
representation which facilitates an efficient EM algorithm to obtain MAP estimates. 

In Table 1 we have listed the log densities for the sparsity-priors mentioned above. From Table 
1 we see an immediate benefit of the generalized t density is that it has a simple analytic form 
compared with other sparsity-priors and is well known and understood by statisticians, which 
in turn aids explanation to data owners. The generalized t is also easy to compute relative 
to the NEG, normal-gamma and Horseshoe priors which all involve non-standard computation, 
making them more challenging to implement and less amenable to fast parallel GPU simulation. 
Finally, the setting of the parameters of the generalized t is helped as statisticians are used to 
thinking about degrees of freedom and scale parameters of Student's densities. We believe this 
will have an impact on the attractiveness of the prior to non-expert users outside of sparsity 
researchers. 





oc logp(/3j) 


double-exponential 




normal-gamma 
NEG 

Horseshoe 

Gt{a, b) 


(i-A)l0g|^,|-l0gi^;,_l/2 (^) 

log [/o~(2r2A2)-V2exp ( (1 + Xy^^d}}^ 
-(a + l)log(l + ^) 



Table 1: Log densities for the double-exponential and various sparsity-priors. K{') denotes a 
Bessel function of the third kind and Dy{z) is the parabolic cylinder function. The Horseshoe 
prior does not have an analytic form. 



3 Genotype Data 

In the development and testing of the methods in this paper we made use of anonymous genotype 
data obtained from the Wellcome Trust Centre for Human Genetics, Oxford. The data consists 
of a subset of single-neucleotide polymorphisms (SNPs) from a genome- wide data set originally 
gathered in a case-control study to identify colorectal cancer risk alleles. The data set we 
consider consists of genotype data for 1859 subjects on 184 closely spaced SNPs from a hit- 
region on chromosome 18q. We standardise the columns of X to be mean zero and unit standard 
deviation. A plot of the correlation structure across the first 100 markers in the region is shown 
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in Fig 3. We can see the markers are in high LD with block like strong correlation structure 
that makes multiple regression analysis challenging. 

We generated pseudo-phenotype data, y, by selecting at random five loci and generating co- 
efficients, /3 ~ A^(0,0.2), judged to be realistic signal sizes seen in GWAS [23]. All other 
coefficients were set to zero. We then generated phenotype, case-control, binary y data by sam- 
pling Bernoulli trials with probability given by the logit of the log-odds, for an individuals 
genotype x^. We considered two scenarios. The first when we construct a genotype matrix 
of 500 individuals using the five relevant markers and an additional 45 "null" markers whose 
corresponding coefficients were zero. The second data set uses all available markers and all 
subjects to create a (1859 x 184) design matrix. 



4 Sequential Monte Carlo and graphic card simulation 



We propose to explore the posterior distributions of /3|X,y,a,c with varying c = h/a via 
Sequential Monte Carlo (SMC) sampler methodology [24]. In particular, we let the target 
distribution at time t be 7rt(/3) = p(/3|X, y, a, 6t/<^) with t — {1,...,T} and some specified 
sequence {ht\J^i where hi > bi^i and is small enough for the prior to dominate the likelihood 
with all coefficients posterior distributions heavily concentrated around zero, and bi large enough 
for the prior to be diffuse and have limited impact on the likelihood. In order to compute 
posterior distributions over the range of {bj}J^^ we make use of sequential Monte Carlo samplers 
indexed on this sequence of scales. 



It is important to note that we cannot evaluate 

p{y\X,fi)pifi\a,bt) 



vrt(/3) 



Jpiy\X,l3)p{f3\a,bt)df3 



since we cannot evaluate the integral in the denominator, the marginal likelihood of y. However, 
we can evaluate 7t(/3) = Zt7rt{f3) where Zt = J p(y|X,/3)p(/3|a, 6t)(i/3. 

At time t we have a collection of particles {/3i''^}fLi and accompanying weig hts {wi'^jfL^ 
such that the empirical distribution 

N 

i=l * 

is an approximation of vr^. The weights are given by 
where 

We use an Markov chain Monte Carlo (MCMC) kernel Kt with invariant distribution TVt to 
sample each particle /3[^\ ie. ^ Kt{f3^\^ •). The backwards kernel Lt-i used to weight the 
particles is essentially arbitrary but we use the reversal of Kt 

U-i{Pu Pt-i) = — 
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so that the weights in (6) are given by 

_i 1 ; — 

as suggested in [25, 26, 24]. 



Since the weights do not depend on the value of /3t, it is better to resample the particles before 
moving them according to Kt, when resampling is necessary. We resample the particles when 
the effective sample size [27] is below 3/4A^. Also note that even if we knew the normalizing 
constants Zt-i and Z^, they would have no impact since the weights are normalized in (5). We 
can however obtain an estimate of Zf/Zf-i at time t via 



- = / (7) 
-1 J jt-i[Pt-i) 

«^W,»^,(/3a,/3«) = /^ (8) 



which allows one to estimate the unnormalized marginal density Zt/Zi for each value of bt via 



Zt _ Yjt Zt 



Zi ~ 1 h=2 Zt. 



If we additionally give log 6 a uniform prior, then these unnormalized marginal densities can 
also be used to weight the collections of particles at each time, allowing us to approximate the 
distribution of /3 with b marginalized out. 



T ^ N 



Note that this is equivalent to systematic sampling of the values of b. 



4.1 Further Details 



For the experiments we used the data with n = 500 and p = 50, we use N = 8192 particles and 
we consider two settings of the degrees of freedom a = 1 and a = 4 with T = 450 and T = 350 
respectively, with the lower T for a = 4 explained by the low marginal likelihood associated with 
small values of b when a is large. We set bt = 2(0.98)*"^ for t G {1, . . . ,r}. Given the nature 
of the posterior, a simple ^-dimensional random walk Metroplis-Hastings kernel does not seem 
to converge quickly. We found that using a cycle of kernels that each update one element of (3 
using a random walk proposal with variance 0.25 mixed better than a p-dimensional random 
walk proposal. To ensure that our kernels mix fast enough to get good results we construct Kt 
by cycling this cycle of kernels 5 times. This is computationally demanding since each step of 
the algorithm consists of 5Np MCMC steps, each of which has a complexity in 0(n), since the 
calculation of the likelihood requires us to update X/3 for a change in one element of /3. 

In fact, computation on a 2.67GIIz Intel Xeon requires approximately 20 hours of computation 
time with these settings and T = 450. However, we implemented the algorithm to run on 
an NVIDIA 8800 GT graphics card using the Compute Unified Device Architecture (CUDA) 
parallel computing architecture, using the SMC sampler framework in [28] to take advantage of 
the ability to compute the likelihood in parallel for many particles. On this hardware, the time 
to run the algorithm is approximately 30 minutes for T = 450, giving around a 40 fold speedup. 



8 



The initial particles for t = 1 are obtained by running an MCMC algorithm targeting tti for 
a long time and picking suitably thinned samples to initialize the particles. The weights of 
each particle start off equal at While not ideal, trivial importance sampling estimates 

have a very low effective sample size for reasonable computational power and, in any case, the 
SMC sampler methodology requires us to have a rapidly mixing MCMC kernel for our resulting 
estimates to be good. 

The particle with the highest posterior density at each time is used as an initial value for the 
MAP algorithm described in Section 2.1. The density of the estimated MAP from this initial 
value is compared with the MAP obtained by using the previous MAP as an initial value and the 
estimate with the higher posterior density is chosen as the estimated MAP. This has the effect 
of both removing variables that the original algorithm has not removed and keeping variables 
in that the original algorithm has not, as it is not possible for the MAP algorithm to explore 
the multimodal posterior density fully. 

While the independence of the weights of the particles makes it possible to adapt the selection of 
bt allow aggressive changes in b while maintaining a given effective sample size [29, 30], we found 
that this scheme had a poor effect on the quality of our samples. This is primarily because we 
discovered that our cycle length for Kf was not large enough to get the particles to approximate 
TTt after a large change in 6, despite a small change in effective sample size. Increasing the cycle 
length has an essentially linear effect on the computational resources required, so this appears 
to be a poor option for reducing the computational complexity of the method. In addition, 
having bt on a logarithmic schedule enables us to use the simple approximation to the posterior 
of /3|X,y,a given by (9). 

For the larger experiment with n = 1859, p = 184 and a = 4, we use the same parameters except 
that we used the schedule bt = (0.98)*"^ with T = 250. This was sufficient to capture the a 
wide range of model complexities. Nevertheless, the running time for this experiment on our 
GPU was 7.5 hours. This demonstrates the key practical benefit to using CPUs, as a similar 
run on a single CPU would take upwards of 5 days. 

4.2 Tuning the SMC sampler 

Following exploratory analysis we found that a sequence of scales as bt = &i(0.98)*~-'^ worked 
well, and that setting bi equal to one or two produced a diffuse enough prior when sample size 
was in the 100s or 1000s, as is the situation in GWAS. 

In practice, one will have to experiment with the parameters of the SMC sampler in order to 
ensure satisfactory results. In particular, given an MCMC kernel Kt^ one must decide how 
many cycles should be performed per time step to mix adequately in relation to how quickly 
one wishes to decrease b. In an ideal scenario, Kt produces essentially independent samples 
from TTt regardless of the current set of particles. This would allow one to set {bt}f=i such that 
computational power is only spent in areas of b that are of interest. In practice, however, Kt will 
often only mix well locally and so small steps in b are desirable to ensure that the Monte Carlo 
error in the scheme does not dominate the results. It is difficult to characaterize the relationship 
between the number of cycles of K to perform and how aggressively one can decrease b in general. 
However, in this context, one might not want to decrease b too quickly anyway, so that a smooth 
picture of the relationship between successive posteriors can be visualized. 

The number of particles, is a crucial parameter that should be set at such a level that some 
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fraction of N can provide a suitable approximation of each intermediate target density. In 
practice, the number of effective samples is lower than N and the effective sample size is an 
attempt to ensure that this number is controlled. It is standard practice to choose to resample 
particles when the effective sample size is in (2/3, 1)A/'. We choose to resample the particles 
when the effective sample size fell below 3 /AN. 

A recommended way to test whether the sampler is performing acceptably is to run an MCMC 
scheme with fixed b = bt for some intermediate value of t and compare the samples one obtains 
after convergence with those produced by the SMC sampler. This is possible in this context 
because the posterior is not particularly troublesome to sample from for fixed b with a sensible 
kernel. 

Finally it is worth noting that if the aim was to infer a single distribution given a single setting 
of c, one could achieve better results via one long MCMC run. The benefit of the SMC sampler 
is that it provides a unified method to obtain all of the exploratory statistics and estimates of 
the marginal likelihood in a single run. 



5 Sparsity-Path- Analysis 

Lasso plots of MAP estimates as functions of the penalty parameter A are a highly informative 
visual tool for statisticians to gain an understanding of the relative predictive importance of 
covariates over a range of model complexities [31]. However, MAP statistics are single point 
estimates that fail to convey the full extent of information contained within the joint posterior 
distribution. One key aim of this report is to develop full Bayesian exploratory tools that utilize 
the joint posterior information. 

In this section we describe graphical displays of summary statistics contained in SPA results 
by first considering the n = 500 p = 50 data set described above. In particular the in- 
dices of the variable with non-zero /3's are I = {10,14,24,31,37} with corresponding /3x = 
{-0.2538, 0.4578, -0.1873, -0.1498, 0.0996}. In the Gt{a, c) prior we found the Gt{a = 1, c), or 
Gt{a = 4, c), to be good default settings for the degrees of freedom both leading to heavy tails. 
We ran the SMC-GPU sampler described above and then generated a number of plots using 
the set of particles and their weights. 



5.1 Path plots 

In Fig. 4 we plot four graphs of summary statistics obtained from the SMC sampler across a 
range of scales log(c) G (—8,0). 

(a) MAP plots: In Fig. 4(a) we plot the MAP paths of all 50 coefficients moving from the most 
sparse models with c = where all /3map = through to c = 1 when all /3map 7^ 0. 
The true non-zero coefficients are plotted in colors other than gray. One can observe that 
the MAP paths are non-smooth in log(c) such that the modes suddenly jump away from 
zero. This is a property of the non-log-concavity of the posterior distributions; such that 
the marginal posterior densities can be bimodal with one mode at f3j = and another 
away from zero. As c increases the MAP mode at zero decreases and at some point the 
global mode switches to the mode at 7^ 0. This can be seen clearly in Fig. 5 were we 



10 



plot the marginal posterior distribution of ^24 over the range logc G (—5,4); c.f. ^24 is 
shown as as the red curve in Fig. 4(a). We can see in Fig. 5 the non-concavity of the 
posterior density and how the global mode jumps from to —.04 as logc is around 
—4.75. As mentioned above the MAP is found by iterating the EM algorithm beginning 
from the particle with largest posterior density. 

(b) Median plots: In Fig. 4(b) we plot the absolute value of the median of the marginal 
distribution of /3j's. This is a plot of Zj(log[c]) vrs logc, for j = 1, . . . ,p, where, 

where F^^{') is the inverse cumulative posterior distribution of 

/X 
^(/3j|X,y,a,c)(i/3j 
-00 

and where 7r(/3j |X, y , a, c) is the marginal posterior distribution of /3j given c; 7r(;Sj |X, y , a, c) = 
J_^. 7r(/3j, /3_j|X, y, a, c)(i/3_j where index {—j} indicates all indices other than the jth. 
The plot of absolute medians gives an indication of how quickly the posterior distribu- 
tions are moving outward away from the origin as the precision of the prior decreases. 
By plotting on the absolute scale we are better able to compare the coefficients with one 
another and we also see that, unlike the MAPs, the medians are smooth functions of c. 

(c) Posterior for scale c: In Fig. 4(c) we show the marginal posterior distribution p(c|X, y, a), 

p(c|X,y,a)= / p(c,/3|X,y,a)d/3. 

J/3 

The posterior on c graphs the relative evidence for particular prior scale. The mode of 
Fig. 4(c), 

c = argmax7r(c|X, y, a) 

c 

is indicated on plots (a),(b),(d) by a vertical dashed line. 

(d) Concentration plots: In Fig. 4(d) we plot the concentration of the marginal posteriors of 
/3j's around the origin. In particular, for user specified tolerance A, this is a plot of V{c) 
vrs c where, 



V{c) ^l- r 7r(/3,|X,y,a,c)d/3, = 1 - Pr(/3, G (-A, A)|X,y,a,c) 

This is a direct measure of the predictive relevance of the corresponding covariate (geno- 
type). In Fig. 4(d) we have set A = 0.1 although this is a user set parameter that 
can be specified according to the appropriate level below which a variable is not deemed 
important. 



Taken together we see the SPA plots highlight the influential genotypes and the relative evidence 
for their predictive strength. We can also gain an understanding of the support in the data for 
differing values of log(c). Having generated the SPA plots it is interesting to compare the results 
of changing the degrees of freedom from the Gt{a — 4, c) to Gt{a — l,c). In Fig. 6 we show 
plots corresponding to Fig. 4 but using a = 1 degrees of freedom. We can see that, as expected, 
Gt{a = l,c) produces sparser solutions with only three non-zero MAP estimates at the mode 
of the posterior for c. Moreover, comparing the concentration plots Fig. 6(d) with Fig. 4(d) at 
the marginal MAP estimate of c we see that for a = 1 we see much greater dispersion in the 
concentration plot. 
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5.2 Individual coefficient plots 



Examination of the plots in Fig. 4 may highlight to the statistician some interesting predictors 
to explore in greater detail. Individual coeffcient plots of summary statistics can be produced to 
provide greater information on the posterior distributions. In Fig. 7 we show summary plots for 
four representative coefficients with their 90% credible intervals (green), median (black), mean 
(blue) and MAP (red). These are obtained from the set of weighted SMC particles. We can see 
that as expected the mean and median are smooth functions of the prior scale, which the MAP 
can exhibit the characteristic jumping for bimodal densities. In Fig. 8 we show corresponding 
density estimates. These coefficients were chosen to represent markers with strong association 
Fig. 7(a), weaker association Fig. 7(b), (c) and no association Fig. 7(d). We can see in the plots 
for Fig. 7(d) and Fig. 8(d) that for a null marker with no association signal the MAP appears 
to be much smother in log(c). 

Equivalent plots but for a = 1 are shown in Fig. 9 and Fig. 10 where we see the greater sparsity 
induced by the heavier tails of the GT{a = 1, c) relative to GT{a = 4, c). 

5.3 Marginal plots 

The SMC sampler also allows us to estimate the marginal posterior probability of /3, using (9), 
integrating over the uncertainty in c. 



Moreover we can also calculate the marginal posterior concentration away from zero, for given 
tolerance A as. 



In Fig. 11 we plot summaries for the marginal posterior distributions of j3 as well as the marginal 
concentrations, for a = 4 and a — 1. We can see in Fig. 11 that the marginal plots provide a 
useful overview of the relative importance of each variable. 

5.4 Comparison to double-exponential prior, a ^ oo 

It is informative to compare the results for the Gt(a, c) prior above with that of the double- 
exponential prior oc exp(— /3/c) which is obtained as a generalized t prior with g = 1 and 
a ^ CO. In Fig. 12 we show SPA plots for this case. We can see the much smoother paths of 
the MAP compared with Fig 4. This can also be seen in the individual coefficient plots shown 
in Fig. 13 and Fig. 14. It is interesting to investigate in more detail the posterior density of 
/324 the coefficient with strongest evidence of association. This is shown in Fig. 15 and should be 
compared to Fig. 5 for a = 4. We can clearly see that the heavier tails of a = 4 induces greater 
sparsity and a rapid transition from the posterior concentrated around zero to being distributed 
away from zero. 
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5.5 Large Data Set 



We next analysed the larger data set with n = 1859 and p — 184. The indices of the 
non-zero cofficients are I = {108,22,5,117,162} with the same values retained for /3z = 
{-0.2538,0.4578,-0.1873,-0.1498,0.0996}. The SPA plots are shown in Fig.l6 with corre- 
sponding individual coefficient plots in Fig. 17 and Fig. 18. We can see in Fig. 16 that there is 
much stronger evidence that coefficients {108,22,5,117} are important. Interestingly variable 
162 is still masked by other predictors. In comparison with Fig. 4(c) we observe that a smaller 
value of c is supported by the larger data set. 

From Fig. 16(d) we can see that one null-coefficient, /3io7, (with true /3io7 = 0) appears to 
have some evidence of predictive importance. In Fig. 19 we show summaries of the marginal 
distributions and concentration plots having integrated over the uncertainty in c. The null- 
coefficient SNP turns out to be situated adjacent to a predictive SNP /3io8 = —0.2538 shown 
as the blue line in Fig. 16 (a), and the two markers are in high LD with correlation coefficient 
(7(Xio7, Xios) = 0.87, leaving some ambiguity in the source of the association signal. The plot of 
absolute medians Fig. 16(b) helps partly resolve this issue pointing to only four clear association 
signals around the mode c of the marginal probability. We can drill down a little further and 
plot summary statistics for the posterior distribution for /3io7 shown in Fig. 20. In Fig. 20 we 
can see that the MAP (red line) holds at zero for almost all values of c, adding support that 
this is a "null" marker, although the credible intervals (green lines) show there is considerable 
uncertainty in the actual value. In such a way we can see how complex association signal can 
be explored, gaining a better understanding of the source of association signal. 



6 Discussion 

We have presented an exploratory approach using generalized t priors that we call Bayesian 
sparsity-path-analysis to aid in the understanding of genetic association data. The approach 
involves graphical summaries of posterior densities of coefficients obtained by sweeping over a 
range of prior precisions, including values with low posterior probability, in order to characterize 
the space of models from the sparse to the most complex. 

This use of SMC methodology is ideally suited to the inference task, by indexing the SMC on 
the scale of the prior. The resulting collection of weighted particles provides us with approxi- 
mations for the coefficient posterior distributions for each scale in addition to estimates of the 
marginal likelihood and allows for improved robustness in MAP computation. The simulations 
are computationally demanding and would take days worth of run-time using conventional 
single-threaded CPU processing. To alleviate this we make use of many-core GPU parallel 
processing producing around a 20-40 fold improvement in run-times. This has real benefit in 
allowing for data analysis within a working day for the larger data set we considered. 
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Figures 




X 



Figure 1: Plot of log probability under the Gt{l, 1) (solid line) and the Lasso prior DE{0, 1) 
(dashed line). 




Figure 2: Estimate of the posterior mean under a single normal observation, for the Gt{a = 
1,6 — 0.1) (blue dashed) and Gt{a = 4, 0.1) (red dashed). We can see the sparsity by the setting 
to zero around ^ = and the approximate unbiasedness with reduction in shrinkage as |y| >> 0. 
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Figure 3: Plot of correlation structure across the first 100 adjacent genotype markers from 
chromosome 18q. 
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Figure 4: SPA plots with a = 4 degrees of freedom; using 50 markers around 18q. The 5 non- 
zero coefficients are indicated by non-grey lines. The vertical dashed line indicates the mode of 
the marginal likelihood, 7r(c|X,y,a), for c as shown in plot (c). 
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Figure 5: Plot of marginal posterior of /324 as a function of c, c.f. the red curve in Fig. 4(a). We 
can see the bimodality and hence the jumping of the MAP mode in Fig. 4(a) as logc changes 
from (-5, -4). 




(a) MAPs 





(b) absolute medians 




(c) posterior density 



-6 -4 -2 
log(c) 

(d) concentrations 



Figure 6: SPA plots with a = 1 degrees of freedom; using 50 markers around 18q. The 5 non- 
zero coefficients are indicated by non-grey lines. The vertical dashed line indicates the mode of 
the marginal likelihood, 7r(c|X, y, a), for c as shown in plot (c). This Figure should be compared 
with Fig. 4 using a = 4 degrees of freedom. 
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Figure 7: Stats for individual coefficients from Fig. 4 with a = 4. For each coefficient we plot 
90% credible intervals (green), median (black), mean (blue) and MAP (red). 




(c) a = 4, j = 31 (d) a = 4, j = 1 

Figure 8: Posterior density plots corresponding to Fig. 7, a = 4 
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(a) a = 1, j = 14 
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Figure 9: Stats for individual coefficients from Fig. 6 with a = 1. For each coefficient we plot 
90% credible intervals (green), median (black), mean (blue) and MAP (red). Compare with 
Fig. 7 where a = 4. 




(c) a = 1, i = 31 (d) a = 1, i = 1 



Figure 10: Posterior density plots corresponding to Fig. 9, a = 1 
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Figure 11: Marginal plots: (a) the summary stats from marginal posterior distributions showing 
MAPs (crosses), Medians (stars), and 90% credible intervals (bars) for a = 4 degrees of freedom; 
(b) the marginal concentrations A = 0.05 red bars, A = 0.1 (blue bars) for a = 4; (c) marginal 
posteriors as in (a) but with a = 1 prior; (d) marginal concentrations as in (b) but for a = 1, 
prior. 
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(a) MAPs 





(c) marginal density (d) concentrations 

Figure 12: SPA plots for the double-exponential prior, Gt{a oc, c) 
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(c) double-exponential j = 31 
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Figure 13: Stats for individual coefficients from Fig. 12 with double-exponential prior. For each 
coefficient we plot 90% credible intervals (green), median (black), mean (blue) and MAP (red). 
Compare with Fig. 7 where a = 4. 
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Figure 15: Plot of marginal posterior of /324 as a function of c for double-exponential prior 
a ^ oo. Under the double-exponential prior, a ^ oc, we see the posterior remains log-concave 
and hence the MAP is a smooth function of logc; as compared to Fig. 5. 
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Figure 16: SPA plots using the larger data set n — 1859, p — 184 
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Figure 17: Summary statistics for posterior distributions for individual coefficients, 90% credible 
intervals (green), median (black), mean (blue) and MAP (red). 
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(c) a-4,j = 117 (d) a = 4,j = 115 



Figure 18: Posterior density plots of coefficients in Fig. 17 




Figure 19: Marginal plots: (a) the summary stats from marginal posterior distributions showing 
MAPs (crosses), Medians (stars), and 90% credible intervals (bars) for a = 4 degrees of freedom; 
(b) the marginal concentrations A = 0.05 red bars, A = 0.1 (blue bars) for a = 4 
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Figure 20: Stats for individual coefficient /3io7 showing 90% credible intervals (green), median 
(black), mean (blue) and MAP (red). 
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